Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
نویسندگان
چکیده
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abby FineReader 7 for each page are available as a resource. Additionally, the scanned images (300 dpi) of all pages are included in order to facilitate tests with other OCR software.
منابع مشابه
From Historic Books to Annotated XML: Building a Large Multilingual Diachronic Corpus
This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a mult...
متن کاملIn Search of a Gold Standard in Studies of Deception
In this study, we explore several popular techniques for obtaining corpora for deception research. Through a survey of traditional as well as non-gold standard creation approaches, we identify advantages and limitations of these techniques for webbased deception detection and offer crowdsourcing as a novel avenue toward achieving a gold standard corpus. Through an indepth case study of online h...
متن کاملCreating Multilingual Gold Standard Corpora for Biomedical Concept Recognition
We describe our approach to create gold standard corpora for biomedical concept recognition in multiple languages, including English, French, German, Spanish, and Dutch. The annotations are based on a subset of the Unified Medical Language System and cover a wide variety of semantic groups.
متن کاملStrategies for Reducing and Correcting OCR Errors
In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Sinc...
متن کاملChallenges in Building a Multilingual Alpine Heritage Corpus
This paper describes our efforts to build a multilingual heritage corpus of alpine texts. Currently we digitize the yearbooks of the Swiss Alpine Club which contain articles in French, German, Italian and Romansch. Articles comprise mountaineering reports from all corners of the earth, but also scientific topics such as topography, geology or glacierology as well as occasional poetry and lyrics...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016